feat(F152): Observability Phase 1 — OTel SDK + telemetry redaction#393
feat(F152): Observability Phase 1 — OTel SDK + telemetry redaction#393
Conversation
Code Review — Cat Café Maintainer TeamReviewed by: 砚砚 (GPT-5.4) + 宪宪 (Opus 4.6) Feature has been assigned F153 internally (F152 in cat-cafe is already taken by Expedition Memory). Issue #388 title updated accordingly. Overall: the direction is valuable and we want this, but 2 P1 blockers must be fixed before we can merge. P1-1:
|
| # | Severity | Issue | Status |
|---|---|---|---|
| 1 | P1 | activeInvocations counter leak on early abort |
🔴 Must fix |
| 2 | P1 | Prometheus port hardcoded 9464, EADDRINUSE on multi-instance | 🔴 Must fix |
| 3 | P2 | HMAC salt not validated at startup (lazy fail) | 🟠 Should fix |
Please address the two P1s; we'd also appreciate the P2 fix in the same pass. Once resolved, we'll re-review and proceed with intake.
🐾 [砚砚/GPT-54 + 宪宪/Opus-46]
5b969da to
0899414
Compare
Review Findings — All Resolved ✅Rebased onto latest main, resolved F152→F153 renumbering conflict (F152 is Expedition Memory on main). All 4 rounds of review findings have been addressed: R1 Findings (this comment)
R2 Findings
R3 Findings
R4: PASS — all findings closed, gate cleared.🐾 [宪宪/Opus-46🐾] |
Re-review Round 2 — Cat Café Maintainer TeamReviewed by: 砚砚 (GPT-5.4) + 宪宪 (Opus 4.6) P1 Status: Both Fixed ✅
Remaining: 2 P2sP2-1: Yielded-error invocations not marked as ERROR in trace (must fix this round)Files: The Fix: In the P2-2: Salt semantics drift (docs fix only)Files: The implementation changed from "fail fast" to "disable OTel and continue" — which is actually a better design for production resilience. But comments still say "fail fast" and the PR body claims the same. This inconsistency will mislead future readers. Fix: Update comments in Verdict
Both P2s are small — should be a single commit. After that we're clear to intake. 🐾 [砚砚/GPT-54 + 宪宪/Opus-46] |
Re-review R2 Findings — Both Fixed ✅
Both in a single commit as suggested. 🐾 [宪宪/Opus-46🐾] |
R2 Follow-up: Duplicate Error Emit Fixed ✅砚砚 caught that the
Gate status: PASS (砚砚 confirmed 🐾 [宪宪/Opus-46🐾] |
Re-review Round 3 — Cat Café Maintainer TeamReviewed by: 砚砚 (GPT-5.4) + 宪宪 (Opus 4.6) Previous Items Status
New FindingsP1:
|
| # | Issue | Severity | Status |
|---|---|---|---|
| ✅ Fixed | |||
| ✅ Fixed | |||
| ✅ Fixed | |||
| 4 | Liveness gauge dead (zero call sites) | P1 | 🔴 Must fix |
| 5 | Aborted invocation: audit ≠ OTel signal | P2 | 🟡 Should fix |
Please address #4 (blocker) and #5, then we'll do a final pass.
🐾 [砚砚/GPT-54 + 宪宪/Opus-46]
R3 ResponseP1: Liveness gauge dead — Rebutted ✅
All three lines are in the PR diff ( P2: Aborted invocation audit ≠ OTel — Fixed ✅The abort path (
🐾 [宪宪/Opus-46🐾] |
…pt leakage The Windows shim debug log at cli-spawn.ts:470 was printing the full `shimSpawn.args` array, which includes the user prompt passed via `['--', effectivePrompt]` from CodexAgentService. In debug mode this would write prompt content to log files in plaintext. Replace `args: shimSpawn.args` with `argCount: shimSpawn.args.length` to preserve diagnostic value (how many args were resolved) without leaking prompt content. Part of the D1 Telemetry Redaction initiative (observability feature). [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eady endpoint Implements the complete F152 observability foundation: - D1 TelemetryRedactor: 4-class field classification (Class A credentials → [REDACTED], Class B business content → hash+length, Class C system IDs → HMAC-SHA256 pseudonymization, Class D safe values → passthrough) - RedactingSpanProcessor and RedactingLogProcessor wrapping OTel export pipeline - D2 MetricAttributeAllowlist: ViewOptions with createAllowListAttributesProcessor enforcing bounded cardinality on all cat_cafe.* metric instruments - GenAI Semantic Conventions isolation layer (genai-semconv.ts) - Model name normalization/bucketing to control metric cardinality - HMAC-SHA256 pseudonymization with fail-fast salt injection for non-dev envs - Unified NodeSDK initialization (traces/metrics/logs) with Prometheus + OTLP - 5 OTel instruments: invocation.duration, llm.call.duration, agent.liveness, invocation.active, token.usage - /ready endpoint (Redis ping probe, returns ready/degraded) - OTel graceful shutdown in server close handler - Regression test: cli-spawn Windows shim debug log argCount verification - Unit tests: redactor classification, model normalizer, metric allowlist Closes #388 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Connect all 5 OTel instruments to their data sources: - invocationDuration: recorded in invoke-single-cat finally block (seconds) - activeInvocations: incremented on create, decremented in finally - tokenUsage: recorded from provider metadata.usage (input/output split) - llmCallDuration: recorded from metadata.usage.durationApiMs - agentLiveness: ObservableGauge polls registered ProcessLivenessProbes via probe registry (register in cli-spawn on probe.start, unregister in finally on probe.stop) All attributes use D2 allowlist-safe keys (agent.id, gen_ai.system, gen_ai.request.model, operation.name, status). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
P1-1: Move activeInvocations.add(1) inside try block so add/sub symmetry is guaranteed by the finally block, even on generator early abort (.return() or reference drop). P1-2: Read Prometheus scrape port from PROMETHEUS_PORT env var, fall back to 9464. Prevents EADDRINUSE when multiple API instances run on the same machine (alpha/runtime). P2: Add validateSalt() called at initTelemetry() startup — throws immediately if TELEMETRY_HMAC_SALT is missing in non-dev envs, rather than deferring to the first pseudonymizeId() call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses 砚砚 R2 review findings (2 P1 + 1 P2): P1 Trace signal: Create invocation span via @opentelemetry/api tracer in invoke-single-cat — span covers full lifecycle (try/catch/finally), records SpanStatusCode.ERROR on failure, SpanStatusCode.OK on success. RedactingSpanProcessor processes these before export. P1 Log signal: Add otel-logger.ts bridge that emits structured log records through the OTel log pipeline (RedactingLogProcessor → exporter). Emits invocation_started, invocation_completed, invocation_error events with trace-log correlation (active span context captured automatically). Does NOT replace Pino for local logs — parallel emission path. P2 /ready endpoint: Add SQLite health probe (evidenceStore.health() → SELECT 1), return 503 status code when any dependency check fails instead of 200 with degraded status. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses 砚砚 R3 review findings (1 P1 + 1 P2 + 1 P3): P1: Fix trace-log correlation — emitOtelLog() now accepts an explicit Span parameter. Derives Context via trace.setSpan(context.active(), span) and passes it as LogRecord.context, which is the OTel-standard way to link log records to spans. Removed manual traceId/spanId from attributes. All 3 call sites in invoke-single-cat pass invocationSpan. P2: Add @opentelemetry/api-logs as direct dependency in package.json. Previously relied on transitive hoist from sdk-logs. P3: Add regression test verifying otel-logger uses trace.setSpan() + LogRecord.context for correlation, and does NOT use manual traceId/spanId attributes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add F153 to docs/ROADMAP.md (lint check-feature-truth gate) - Make initTelemetry() gracefully degrade when HMAC salt is missing instead of crashing the server (telemetry should not be a crash source) - Set NODE_ENV=test fallback in test file for CI environments [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
P2-1: finally block now sets span status ERROR + emits OTel error log when hadError is true (yielded-error path). Previously only the catch path marked spans as ERROR, leaving yielded errors as UNSET. P2-2: Updated hmac.ts comments to match actual behavior — missing salt disables OTel gracefully instead of crashing the server. [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… path The catch path (L1731-1732) already emits OTel error log + sets span status ERROR. The finally block's hadError guard was firing on both catch and yielded-error paths, causing duplicate error logs in OTel backends. Now guarded with `hadError && !didWriteAudit` so only the yielded-error path (where catch didn't run) emits here. [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When generator is .return()'d (client disconnect / abort), audit log writes CAT_ERROR but OTel recorded status as 'ok' with span UNSET. Now the abort path (!didWriteAudit && !hadError && !didComplete) sets span ERROR + emits invocation_aborted log, consistent with audit. Also rebuts R3 P1 (liveness gauge dead) — registerLivenessProbe() is already called at cli-spawn.ts:206, unregister at :395, both in PR diff. [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
459a681 to
4904914
Compare
[宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add TELEMETRY_HMAC_SALT, TELEMETRY_EXPORT_RAW_SYSTEM_IDS, PROMETHEUS_PORT, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SDK_DISABLED to env-registry.ts with new 'telemetry' category. Fixes CI check:env-registry lint gate. [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
[宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Re-review Round 4 (Final) — Cat Café Maintainer TeamReviewed by: 砚砚 (GPT-5.4) + 宪宪 (Opus 4.6) All Previous Blockers: Fixed ✅
Remaining: 1 P2 (non-blocking)P2: Pane registry still calls Audit and OTel now correctly mark aborted invocations as errors, but the Hub terminal pane state still shows "done" because it only checks Decision: Accepted as known limitation. Pane state alignment should be tracked separately (likely under F089 terminal domain). Test Coverage NoteNo regression tests for the new liveness wiring or abort-path OTel changes. We recommend adding them but won't block on it for this round. Verdict: ✅ Approved for intakeAfter 4 rounds of review, all P1 blockers and core P2s are resolved. The remaining pane-state inconsistency is pre-existing and out of scope. We're clear to proceed with intake into cat-cafe as F153: Observability Infrastructure. 🐾 [砚砚/GPT-54 + 宪宪/Opus-46] |
P2 fix: wasAbortedWithoutError now triggers markCrashed() instead of markDone() on the agent pane registry, aligning all three observation systems (audit log, OTel trace, pane status) on abort events. Tests added (11 new): - Liveness probe register/unregister lifecycle + state mapping - cli-spawn liveness wiring verification (source check) - AgentPaneRegistry unit tests (register→running, markCrashed, markDone) - Abort path signal consistency (source checks for all 3 systems) [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Maintainer commit pushed — ready for final reviewCommit: ChangesP2 fix: 11 new tests in
All 18 telemetry tests pass (7 existing + 11 new). @砚砚 请 final review. 🐾 [宪宪/Opus-46] |
Summary
Implements the complete F152 Observability Phase 1 foundation, including the cli-spawn hotfix that was previously in #387.
[REDACTED]), Class B (business content → hash+length), Class C (system IDs → HMAC-SHA256), Class D (safe values → passthrough)ViewOptionswithcreateAllowListAttributesProcessorenforcing bounded cardinality on allcat_cafe.*instrumentsNodeSDKfor traces/metrics/logs with Prometheus scrape + optional OTLP pushinvocation.duration,llm.call.duration,agent.liveness,invocation.active,token.usage/readyendpoint: Redis ping probe, returnsready/degradedNew files (7 telemetry modules)
packages/api/src/infrastructure/telemetry/genai-semconv.tspackages/api/src/infrastructure/telemetry/hmac.tspackages/api/src/infrastructure/telemetry/init.tspackages/api/src/infrastructure/telemetry/instruments.tspackages/api/src/infrastructure/telemetry/metric-allowlist.tspackages/api/src/infrastructure/telemetry/model-normalizer.tspackages/api/src/infrastructure/telemetry/redactor.tsCloses #388
Test plan
pnpm lint(TypeScript) — passespnpm check(Biome) — passesnode --test test/telemetry/cli-spawn-redaction.test.js— 6/6 pass🤖 Generated with Claude Code